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DATA MANAGEMENT APPLICATION PROGRAMMING INTERFACE FOR A 

PARALLEL FILE SYSTEM 

CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims the benefit of U.S. 
5 Provisional Patent Application No. 60/214,127, filed June 
26, 2000. It is related to four other U.S. patent 
applications, filed on even date, entitled "Data 
Management Application Programming Interface Session 
Management for a Parallel File System''; "Implementing 

10 Data Management Application Programming Interface Access 
Rights in a Parallel File System"; "Data Management 
Application Programming Interface Handling Mount on 
Multiple Nodes in a Parallel File System"; and "Data 
Management Application Programming Interface Failure 

15 Recovery in a Parallel File System." All of these 
related applications are assigned to the assignee of the 
present patent application and are incorporated herein by 
reference . 

FIELD OF THE INVENTION 

20 The present invention relates generally to computer 

file systems, and specifically to implementation of data 
management applications in parallel file systems. 

BACKGROUND OF THE INVENTION 

A wide variety of data management (DM) applications 
25 have been developed to supplement the basic file storage 
and retrieval functions offered by most operating system 
(OS) kernels. Typical DM applications include 

hierarchical storage management (also known as data 
migration) , unattended data backup and recovery, on-line 
30 encryption and compression, and directory browsers. 

IL9-2000-0067US1 1 



i 



39878S5 



These applications, which extend the basic OS kernel 
functions, are characterized by the need for monitoring 
and controlling the use of files in ways that ordinary 
user applications do not require. 
5 In response to this need, the Data Management 

Interfaces Group ( DMIG) was formed by a consortium of 
UNIX® software vendors to develop a standard Data 
Management Application Programming Interface (DMAPI) . 
DMA PI provides a consistent, platform-independent 
10 interface for DM applications, allowing DM applications 
to be developed in much the same way as ordinary user 
1-3 applications. By defining a set of standard interface 

f.g functions to be offered by different OS vendors, DMAPI 

1*4 gives DM software developers the tools they need for 

\J\ 15 monitoring and controlling file use, without requiring 
them to modify the OS kernel. DMAPI is described in 
detail in a specification document published by the Open 
Group (www.opengroup.org), entitled "Systems Management: 
Data Storage Management (XDSM) API" (Open Group Technical 



O 



v " 20 Standard, 1997), which is incorporated herein by 
reference. This document is available at 

www . opengroup . org . 

As noted in the XDSM specification, one of the basic 
foundations of DMAPI is "events." In the event paradigm, 
25 the OS informs a DM application running in user space 
whenever a particular, specified event occurs, such as a 
user application request to read a certain area of a 
file. The event may be defined (using DMAPI) as 

"synchronous," in which case the OS will notify the DM 
30 application of the event and will wait for its response 
before proceeding, or as "asynchronous," in which case OS 
processing continues after notifying the DM application 
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of the event. The area of a file with respect to which 
certain events are defined is known as a "managed 
region . " 

Another fundamental concept in DMA PI is a "token," 
which is a reference to a state that is associated with a 
synchronous event message. The state typically includes 

lists of files affected by the event and DM access rights 
in force for those files. The token may be passed from 
thread to thread of the DM application and provides a 
convenient means for referencing the state. Access 
rights may either be shared with other processes (in 
which case they are read-only rights), or they may be 
exclusive (read-write) rights. 
i:Q Communications between DM applications and the OS 

f : 15 are session-based. The DM application creates the 

j~ session by an appropriate DMAPI function call 

T (dm_create_session () ) . The application then registers 

O event dispositions for the session, indicating which 

ry event types in a specified file system should be 

VI 20 delivered to the session. Multiple sessions may exist 
[a simultaneously, and events in a given file system may be 

delivered to any of these sessions. 

The DMAPI standard, having grown out of the needs of 
UNIX system vendors, is based on the notion of a single 
25 system environment, using a single computing node. DMAPI 
implementations have also been developed for distributed 
file systems, which allow a user on a client computer 
connected to a network to access and modify data stored 
in files on a file server. When a user accesses data on 
30 the file server, a copy of the data is stored, or cached, 
on the client computer, and the user can then read and 
modify the copy. When the user is finished, the modified 
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data are written back to the file server. Examples of 
distributed file systems include Sun Microsystems' 
Network File System (NFS™) , Novell Netware™, Microsoft's 
Distributed File System, and IBM/Transarc' s DFS™ . 
5 Transarc Corporation (Pittsburgh, Pennsylvania) has 
developed a DMA PI implementation for its DFS called 
DMEpi. All of these distributed file systems, however, 
are still essentially single-node systems, in which a 
particular server controls any given file. The DMA PI and 
10 data management applications for such distributed file 
systems are essentially server functions and are not 
distributed among the client nodes. 

IBM's General Parallel File System (GPFS) is a 
UNIX-style file system designed for IBM RS/6000 
multiprocessor computing platforms, such as the SP™ and 
□ HACMP™ systems. GPFS, which runs in the AIX® operating 

33 

~~ system, allows applications on multiple nodes to share 

file data, without mediation of a file server as in 

I'll 

1^ distributed file systems. GPFS is described, for 

P 20 example, in a publication entitled "General Parallel File 
System for AIX: Concepts, Planning and Installation," 
which is available at www.rs6000.ibm.com/resource 
/aix_resource/sp_books/gpf s . GPFS supports very large 
file systems and stripes data across multiple disks for 
25 higher performance. GPFS is based on a shared disk model 
that provides low-overhead access to disks not directly 
attached to the application nodes and uses a distributed 
locking protocol to provide full data coherence for 
access from any node. These capabilities are available 
30 while allowing high-speed access to the same data from 
all nodes of the system. GPFS has failure recovery 
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capabilities, . allowing applications to continue running 
even when node or network component failures occur. 

A series of patents to Schmuck et al. describe 
aspects of a shared parallel disk file system that are 
5 implemented in GPFS. These patents include U.S. 

5,893,086; U.S. 5,940,838; U.S. 5,963,963; U.S. 
5,987,477; U.S. 5,999,976; U.S. 6,021,508; U.S. 
6,023,706; and U.S. 6,032,216, all of whose disclosures 
are incorporated herein by reference. 
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SUMMARY OF THE INVENTION 

Preferred embodiments of the present invention 
provide a DMA PI that is suitable for use in a multi-node, 
f parallel computing environment, and specifically for use 
5jj with parallel file systems. Implementing DMA PI in a 
' parallel file system, such as the above-mentioned GPFS, 
requires enhancements to the functions defined in the 
XDSM standard and alterations in certain basic 
definitions and assumptions that underlie DMA PI 
10 implementations known in the art. The basic semantics 
and functionality of the standard DMA PI model, however, 
are preferably preserved in the parallel system. DM 
application programmers are thus enabled to integrate 
data migration and other DM applications with the 
15 parallel file system in an immediate and straightforward 
manner . 

In preferred embodiments of the present invention, 
computing nodes in a cluster are mutually linked by a 
suitable interconnection to a set of one or more block 

20 storage devices, typically disks. A parallel file system 
is configured so that all nodes in the cluster can mount 
the same set of file system instances. File data and 
metadata, on multiple logical volumes, may reside at 
different nodes. All of the volumes are accessible from 

25 all of the nodes via a shared disk mechanism, whereby the 
file data can be accessed in parallel by multiple tasks 
running on multiple nodes. The enhanced DMAPI provided 
for the parallel file system is used to support DM 
functions, such as automatic data migration, over all of 

30 the nodes and storage volumes in the cluster. 

DM applications may run on substantially any of the 
nodes in the cluster, as either single-node or multi-node 
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parallel applications. The DM application preferably 
starts by creating a session on one of the nodes and 
specifying the DM events that are to be reported to the 
session. The node on which the session is created is 
designated as the session node, and all specified events 
generated by file system operations are reported to the 
session node, regardless of the node at which the events 
are generated. Thus, an event may be generated by a file 
operation on one of the nodes, referred to herein as the 
source node, and delivered to a session on a different 
node, i.e., the session node. If the event is a 
synchronous event, requiring a response from the DM 
application before the file operation can continue, the 
source node will wait to carry out the requested file 
operation until the session node has sent its response 
back to the source node. In contrast, in DMA PI 

implementations known in the art all events and sessions 
take place on a single node. 

There is therefore provided, in accordance with a 
preferred embodiment of the present invention, in a 
cluster of computing nodes having shared access to one or 
more volumes of data storage using a parallel file 
system, a method for managing the ' data storage, 
including : 

initiating a session of a data management (DM) 
application on a first one of the nodes; 

running a user application on a second one of the 
nodes ; 

receiving a request submitted to the parallel file 
system by the user application on the second node to 
perform a file operation on a file in one of the volumes 
of data storage; and 
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sending a DM event message from the second node to 
the first node responsive to the request, for processing 
by the data management application on the first node. 

Preferably, initiating the session includes 
5 initiating the session in accordance with a data 
management application programming interface (DMAPI) of 
the parallel file system, and receiving the request 
includes processing the request using the DMAPI. Further 
preferably, receiving and processing the event message at 
10 the first node using one or more functions of the DMAPI 
called by the data management application. Additionally 
]J or alternatively, sending the event message includes 

'»g sending the message for processing in accordance with a 

disposition specified by the data management application 
|ji 15 using the DMAPI for association with an event generated 
^ by the file operation. 

e; Preferably, the method further includes receiving a 

:;S response to the event message from the data management 

FLJ application on the first node, and performing the file 

j;^ 20 operation requested by the user application on the second 
\- node subject to the response from the data management 

application on the first node. Further preferably, 

receiving the request includes submitting the request 
using a file operation thread running on the second node, 
25 and blocking the thread until the response to the event 
message is received from the first node. Additionally or 
alternatively, sending the event message includes passing 
the event message from a source physical file system 
(PFS) on the second node to a session PFS on the first 
30 node, and receiving the response includes passing a 
response message from the session PFS to the source PFS. 
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In a preferred embodiment, the method includes 
receiving the event message at the first node, obtaining 
a data management access right from a physical file 
system (PFS) at the first node responsive to the event 
5 message, and processing the event message using the 
access right. 

In a further preferred embodiment, receiving the 
request includes receiving first and second requests of 
different types submitted to a physical file system (PFS) 
10 at the second node, wherein based on the different 
request types, sending the event message includes sending 
a. first event message to the first node responsive to the 
jig first request, and sending a second event message 

J;* 5 ? responsive to the second request to a further node, on 

} t ff 15 which a further data management application session has 
l,Z been initiated. Preferably, sending the first and second 

event messages includes receiving at the second node a 
specification of event types and their respective 
I"U dispositions, the event types corresponding to the 

l rjj 

20 requests to perform the file operations, and dispositions 
1^ indicating which of the event messages should be sent to 

which of the nodes, and sending the messages responsive 
to the specification. 

In yet a further preferred embodiment, running the 

25 user application includes running a first user 
application instance on the second node, and running a 
further user application instance on a further one of the 
nodes, and the method includes receiving a further 
request submitted to the parallel file system by the 

30 further user application instance to perform a further 
file operation, and sending a further event message 
responsive to the further request for processing by the 
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data management application on the first node. 
Optionally, the further one of the nodes is the first 
node . 

In still a further preferred embodiment, initiating 
5 the session of the data management application includes 
initiating a data migration application, so as to free 
storage space on at least one of the volumes of data 
storage . 

Preferably, the method includes choosing one of the 
10 nodes to act as a session manager node, wherein 
initiating the session includes sending a message from 
;'S the session node to the session manager node, causing the 

CO session manager node to distribute a specification of 

£3 

%.j events and respective dispositions of the events for the 

l-fj 15 session among the nodes in the cluster, and wherein 
sending the DM event message includes sending the message 
^ in accordance with the dispositions. 

}n In a preferred embodiment, one of the nodes is 

|^ appointed to serve as a respective file system manager 

I; 5 20 for each of one or more file systems in the cluster, and 
v * for each of the file systems, the session manager node 

sends the specification of the dispositions applicable to 
the file system to the respective file system manager, 
which sends the dispositions to all of the nodes in the 
25 cluster on which the file system is mounted. 

Preferably, sending the DM event message includes 
incorporating in the message a data field uniquely 
identifying the second node. 

Further preferably, upon receiving from one of the 
30 nodes other than the first one of the nodes a call for a 
data management application programming interface (DMAPI) 
function in connection with the session, the function is 
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performed only if it does not change a state of the 
session or of an event associated with the session. 

There is also provided, in accordance with a 
preferred embodiment of the present invention, computing 
5 apparatus, including: 

one or more volumes of data storage, arranged to 
store data; and 

a plurality of computing nodes, linked to access the 
volumes of data storage using a parallel file system, and 

10 arranged so as to enable a data management (DM) 
application to initiate a data management session on a 
first one of the nodes, while allowing a user application 
to run on a second one of the nodes, so that, when the 
user application submits a request to the parallel file 

15 system on the second node to perform a file operation on 
a file in one of the volumes of data storage, a DM event 
message is sent from the second node to the first node 
responsive to the request, for processing by the data 
management application on the first node. 

20 There is additionally provided, in accordance with a 

preferred embodiment of the present invention, a computer 
software product for use in a cluster of computing nodes 
having shared access to one or more volumes of data 
storage using a parallel file system, the product 

25 including a computer-readable medium in which program 
instructions are stored, which instructions, when read by 
the computing nodes, cause a session of a data management 
(DM) application to be initiated on a first one of the 
nodes, while allowing a user application to run on a 

30 second one of the nodes, and in response to a request 
submitted to the parallel file system by the user 
application on the second node to perform a file 



IL9-2000-0067US1 



11 



39878S5 



operation on a file in one of the volumes of data 
storage, cause the second node to send a DM event message 
to the first node, for processing by the data management 
application on the first node. 

The present invention will be more fully understood 
from the following detailed description of the preferred 
embodiments thereof, taken together with the drawings in 
which : 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram that schematically 
illustrates a cluster of computing nodes with a parallel 
file system, in accordance with a preferred embodiment of 
5 the present invention; 

Fig. 2 is a block diagram that schematically shows 
details of the parallel file system of Fig. 1, in 
accordance with a preferred embodiment of the present 
invention; 

10 Fig. 3 is a flow chart that schematically 

illustrates a method for handling a DMA PI event generated 
^ by a file operation in a parallel file system, in 

i;0 accordance with a preferred embodiment of the present 

J'* invention; 

If! 15 Fig. 4 is a flow chart that schematically 

]<5 illustrates a method for unmounting an instance of a 

parallel file system, in accordance with a preferred 
jl^ embodiment of the present invention; 

Fig. 5 is a flow chart that schematically 
p 20 illustrates a method for handling a DMAPI function call 
< : ~ in a parallel file system, in accordance with a preferred 

embodiment of the present invention; and 

Fig. 6 is a flow chart that schematically 
illustrates a method for handling a DMAPI session failure 
25 in a parallel file system, in accordance with a preferred 
embodiment of the present invention. 
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DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS 

GLOSSARY 

The following is a non-exhaustive list of technical 
terms that are used in the present patent application and 
in the claims. The list is provided here for the 
convenience of the reader. Certain of the items in the 
list are specific to preferred embodiments of the present 
invention. These terms are described at greater length 
in the Detailed Description following the Glossary. 

• Cluster: A collection of computers interconnected 
by a communication mechanism, typically a 
high-performance switch or a network. The computers 
in the cluster collaborate with each other in 
computations and share data resources. 

• Node: A computer that is part of a cluster. Each 
node in the cluster has a cluster-wide unique 
identity. 

• File System: A hierarchical collection of files 
and file directories that are stored on disk and 
have an identified root, and are accessed using a 
predefined interface. Typically, such interfaces 
follow the prescription of standards, such as Posix. 
The term "file system" is also loosely used to 
describe the data and metadata contained in the file 
system. 

• File System Instance: A file system that is 
mounted on a computer. In a cluster, a given file 
system can have multiple file system instances, each 
instance mounted on a different node. 
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• Physical File System (PFS) : A software component 
that manages collections of data on disks, typically 
using mechanisms and interfaces prescribed by the 
X/Open and Posix standards. The PFS is one of 
layers in the hierarchy of interfaces that are used 
to support the file system and to enable software 
applications to access the file data. Multiple 
different PFSs may coexist on a computer, each used 
to implement a different type of file system. The 
PFS usually runs in the kernel, with possible 
extensions running as daemons in user space. 

• Parallel File System: A PFS running on a cluster 
of nodes, which enables all nodes in the cluster to 
access the same file data concurrently. Preferably, 

15 a ll nodes, share in the management of the file 

systems. Any of the nodes can perform any role 
required to manage the file systems, with specific 
roles assigned to particular nodes as needed. The 
term "parallel file system" is also loosely used to 

20 describe a file system that is managed by a PFS 

software component, as defined in this paragraph. 

• DMAPI: Data Management Application Programming 
Interface, as specified in the above-mentioned XDSM 
standard. This term is also used to denote the 

25 software sub-component implementing the interface in 

the PFS. 

• Session Node: A node in a cluster on which one or 
more data management (DM) sessions have been 
created. (The term "data management" is used as 

30 defined in the XDSM standard.) The term is also 
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used to identify a specific node at which a specific 
session exists. 

• Source Node: A node in a cluster that generates DM 
events. (The term "events" is similarly used as 
defined in the XDSM standard.) The term is also 
used to identify a specific node that generated a 
specific event. 

• Session Manager (SM) : A node in a cluster that is 
assigned the role of coordinating the creation and 
maintenance of DM sessions and DM event dispositions 
on the nodes in the cluster. Typically, the 
sessions may be created on any of the nodes, 
including the session manager node. 

k 'i • File System Manager (FSM) : A node in a cluster 

5,1 3 

fU 15 that is assigned the role of managing the metadata 

U of a specific file system. The node coordinates 

ii 

O among all the instances of the file system that are 

!;;! mounted on the different nodes in the cluster, 

tj * Persistent Data: Data used by a software component 

1,4 20 on a computer or a cluster of nodes, which is not 

lost when the component crashes. Persistent data 
can be recovered and restored when the component 
recovers . 

• Single Node Failure: A failure that occurs on a 
single node in a cluster. Services provided by 
other nodes in the cluster are normally not affected 
by this failure. The term is specifically used to 
describe the failure of the PFS software component. 
All file systems managed by the failing PFS become 
inaccessible on the failing node, but typically 
remain accessible on other nodes in the cluster. 
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• Total PFS Failure: Failure of the PFS software 
component on all the nodes in a cluster. All file 
systems managed by the failing PFS become 
inaccessible on all nodes in the cluster. 

• File System Failure: Failure of a file system 
instance on a single node, possibly due to 
disconnection from disks or from other nodes in a 
cluster. The failing file system instance becomes 
inaccessible on the node where the failure occurred. 
The term is also used to describe the failure of 
multiple (or all, instances of the same file syste m 
on any node in the cluster. 

• Session Failure: Failure of the PFS on a session 
node, at which one or more sessions exist for 
monitoring file systems managed by the failing PFS 
Such sessions can no longer be used, and are hence 
called failed sessions. when the PFS recovers, the 
sessions can be recreated. The term is also used to 
identify the failure of a specific session on a 
specific node. 

SYSTEM OVERVIEW 

F±g - 1 iS a block diagram that schematically 
illustrates a cluster 20 of computing nodes 22, running a 
parallel file system, identified as a physical file 
system (PFS) 28, in accordance with a preferred 
embodiment of the present invention. Nodes 22 are 
connected to one another and to multiple disks 24 by a 
surtable communication mechanism, such as a high-capacity 
switch or network (not shown). The disks may thus be 
accessed in parallel by any of the nodes. Preferably, 
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These appli cations may be either ,(„„, 

to access disks 24 R l\ advant ^ of PFS 28 

Pro g r amm i ng Interface ' (DMapI) D ^ Ma ~- Application 

5 into physical f i l » I Preferably integrated 

pnysical file system (PFS) 28 dm =>„„,• ... 
use DMA PI 26 to track and control ,'■ , aPPllCatlons 32 
-nage file data of fil9 SV 3° t fllB -nd to 

described in detail h k " ClUSt " 20 ' * s 

in detail herembelow For 

D «-n.ger 34 of PFS 28 „ C ° nf Ration 

■ - ° also serves = <, „^ 

JS (SM) for DMAPi 26. sessaon manager 

?FS 28 With DMAPI 26 is tuni . rall 
15 software package for installati tYP1 ° ally SUPPlled « • 
without a comoiet. lnStallatl ° n ° n ^-ter 20, with or 

complete operating system, such as AIX tm 
software may be downloaded to the .1 , 

form, over a nft . ^ d t0 the Cluster ^ electronic 

wv^x a network, f or 
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alternatively be suoblied exa ">P^, or it may 

CD-ROM, for Lstallati °" SUCh " 

. or installation on the cluster nodes. 

further'detaiYs 'J'T dia9 " m th " """""cUy shows 

of PFS 28 and t the S ° ftWa " StIU « u " -d oper.tion 

and particularly of DMAPI 

with a preferr P H k ' ln accordan « 

prererred embodiment nf i-k^ 

-e figure sho„s . session nod 40 a inVSnti °- 
;:: -on r: age (SM) node 44. all of arl nodes ^ 

----- ^Jir^- - — - - 

on node 42 Tvn^ n PPJ-ication 30 is running 

42. Typically, both DM application v * 
application 30 mav h« w . ■ ° n 32 and user 
on multiple node diat " b "ted appl icat i ons , ru 

together o n the SlmUlt ™^ and possibly runn±ng 

y Lner on the same node For- , 

oae. For the sake of simplicity of 
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illustration, however, and without loss of generality, 
only a single node of each type is shown in Fig 2 
Similarly, although SM 34 is shown as runn±ng Qn a 
separate session manager node 44, it may alternatively 
run on any of the nodes in cluster 20, including session 
node 40 and source node 42. 

In an alternative embodiment of the present 
invention, not shown in the figures, DM sessions are 
explicitly replicated on all nodes in the cluster. DM 
events generated at a source node are then delivered to a 
session at the source node itself. Th is embodiment 
requires that DM applications be defined as multi-node 
parallel applications, unlike DM applications known in 
the art. Each event is handled by the DM application 
instance on the source node at which it originated, while 
consistency is maintained among the instances using 
methods known in the art of parallel applications. 

SESSIONS AND EVENT HANDLING 
Upon initiation of DM application 32, the 
application creates a DMA PI session on session node 40 
Dispositions 4 9 of enabled events 4 6 are maintained on 
source node 42 and on SM node 44. A list of enabled 
events can be associated individually with a file and 
globally with an entire file system . Conflicts between 
individual and global event lists are preferably resolved 
ln faV ° r ° f the ^dividual list. Preferably, event lists 
are persistent and are kept with the file system in 
stable storage. Dispositions are not persistent and must 
be set explicitly for each file system after PES 28 is 
started and sessions are created. 
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When user application 
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different, respective file s yst em s " m ° nlt ° r 

DM application 32 can set f ha w 
or c h an 9 e the disposition l^T^ °' 

system from one session <-„ . 3 91Ve " file 
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y node at substantially any time _ 
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updates dispositions 49 in its accordingly 
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system or after fa Lie of th ""^ 3 

t h . event dispositio„; o™ ^ ' " ° btainS 

™°unt the file system fk " additi °"^ "odes 

iie system, thev obtain fu. „■ . 
the relevant FSM Thi ^positions from 

tbM> Thls approach guarantees th^ 4-w „ 
and all noriP c ^ . dncees tha t the FSM 

° j- j. noaes mountma thp fn = 

the most current event hL " lMyS h "« 

— cy of =o_„r at i 0 d ;; p ; y si ::r s '. - hiia maintaini - 

only to the nodes f or ^^Z^ 



session Let! 7"'^ «- 
15 identifiers (IDs) BnSUr£ that session 

Partitioning T^J^T £ T 

— the identity of the ^ " 

A new session i s rrp^BH k 

action ^eate sessio n() on £ ^ ^ 

° — sends a mess - ge tQ ( / n ™ no de , which 

session ID, adds th. • SM generates a 

j-u, aads the session to its list- ^ n 

in the clustPr k ^ ° f a11 sessions 

nodes in the cl 'us I r d tS T" ""^ «™> - 

" -L / ana returns +- k ^ 

session node. The session H " " t0 the 

session node 40 and SM 1 ™ Ud ™ ly «»« 

Session l Ds . " / * ^ °' ^ usages, 

-uster, Z t L^ZT - - 

the node io of s„ 34 * C ° nS1St ^ ° f a ti™ sta mp and 
IDs, „ hich are Z ^ ° f ^ooally- unlque session 

P~X- inTe ZZ^^ ^ — 

invcinrv;::;;: r; io r ( r mes an m « — ^ 

- SS1 ° n °' specifying th e existing 
IL9-2000-0067US1 22 
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session id. 



0 



updates the se^" " ^ 10 t0 SM -MC 

P 63 the session details in its n «<- 

session uill be assumed thi An """^ 

o^=- • tnis manner onlv aft-pr = 

session node failure ^ y arcer a 

ae failure, as described below. The session in 
does not change when an existing • Session ™ 

d m _create_session ( ) can also h 5633100 ±S 

session • * bS USed to modify the 

ses S1 on information string of an existing • 

can can be made only on the ...."^ ™« 

*- ««z r:r d :: P ;r; f r s 22 by 

node. dmapt 9« • * ™_set_disp ( ) on the session 

UMAPI 26 informs SM 34 and t-h« , 

• ' d the SM keeps trarlo nf 

the sessions that- =v-= • UdCK or 

«hen a node " Z'T^ "™ t ^ ' 

the SM a iist of operation, it obtains from 

a ust of sessions that 
mount event. 



are registered for the 



wh ^u ^enas a message to SM 34 

which removes the session from its li st a ! d h w 

the change to all nodes. broadcasts 

failurT th! iCient eVSnt generati ° n recovery from P FS 

llure ' the session and ev^nt i n e 
on nmitiple nodes in \ "formation is replicated 

Fig 2 i„ f " 20 ' P "^«t,ly- as shown in 

rj -y- information reoarHin^ ~ 

i. maintained on both ^ „ ^ r^" 9 " 

- session data 48 are -^rj'rr.r^i: 

ni: :r; n raana9er node 44 • 

r :i :;:r ns . partia 1 inf °™— » — 

trie cluster, including session m 
node address wh Pth. session ID, session 

whether or not +-h<= ^„ 
for t-h Q session is registered 

ror the mount evpnt =r,w , u 

event, and a short prefix of t-h~ 

t^j-ej-ix or the session 
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information string. Dispositions 49 ar „ • . 

34 on node 44 and « n , , ntaxntamed by SM 

^ System is ' m ;; n d te ° d : lu X7 - th. monitored 

^ 34 is responJibie " n d i S r rCe ^ 
session details. when the /m dlS8enUnatln 3 ch »nges in 
state of one ^ ^ ^ " not "^ of a change in 
change to all of the h Sessi °ns, it broadcasts the 

ail of the nodes i n the cluster. 
Replication of fh a 

generated efficient! . events to be 

erriciently, without reoeatertlv „ 
session and disposition ■ f e P e atedly communicating 

also supports effT «*°™-tio„ between nodes. It 

witnout e l/celr;; o r C ° Very ^ ""^ f — 
idling, .^l"^ — -tbods 

hereinbelow with reference to F ^ 7 ^ dM « lbBd 



15 _ 

Fig. 3 



illustrates a method" J ""—"cUy 
in cluster 20 in ,J" " 9 8Vent generated 

«* - presen't LZ^^ ^ " 
20 application 30 invokes T fii e " S « 
at an invocation step so 7""°" °" ^ ^ 
<». operation derates a event tLV"™* ^ ^ 
l"t of enabled events 4 S /J af>f>earS in the 

furnished by SM ,? to n d " P ° Siti °- «. "hich are 

'5 mounted instances Qf n °£' ln ^0 that have 

Typically if the 16 System in question. 

— - u.r. ;^irru' r „ b ;: the ~ is - 

to the user application. an error m — ge 



30 



v^tuaT^L 9 ;;:?:; 0 ;,:; pr rr iy impie ~ - - 

^ y m (VFS) mterface layer of PPq ?o 

definition of the VF9 i= 28 • Tne 

PP3 are well ^n ^Te 7 It "* f 7"""°" ^ 

the art of UNIX-t ype operating 
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y :: ;; e u ? g the at —— «« AIX OP . r . t i„ g 

ai i: tin r h re ation of dmapi 26 with pfs - 

event gen r ^ ^ r ^ ^ in ^ ™ with code for 
code causes the fii n " ^ -e P 52, this 

step 50 to oe ° peratlon cl -nt thread invoked at 

tep 50 to generate the prescribed event if th „ 

15 ^ asynchronous event PFS PR 

28 on source node 42 sends 
an appropriate event message to the «,„ ■ 
n °de 40, and thP session on session 

• , . ' thS "jested file operation ' 

immediately frpp ^ _ 

M i Proceed on source node 42. m the 

example shown in Fia ? h~ ln tne 

be a . m , h _„... !: ' h ° WeVer ' thS SVent is ™ d to 



in is 



20 



In the 

ver, the event is assu 
h causes the file odp 
Cni ; M t0 bl ° Ck «-it a response from thT ™ 

appll ;; s tio ; 8 befo - — - processlng . 

rtb ^ a on source node 42 senri. 
PFS 2ft ™ ■ n eVent me ssage to 

t;> Z8 on session node 40 in = „ 

... , ln accordance with thp 

specified event disposition a+ - 

The event m Sp ° Sltlon ' ^ an event sending step 54. 

he event message header preferably carries a field 
ev_nodeid, which is added to th* h 

defined in ^ . dm_eventm sg structure 

order to d ab °~™*±o„ed ^DSM specification in 

generated at any node in cluster 201 m • , 

» ~ r^ri system Data — — 

DM , mSSS ^ ls enqueued at session node 42. 

hand.es th! " °" B ° d ' 40 and 

step 6 F h " ^ P " " an ^ ent h ««i»« 

"r fuL- PUrP ° Se ' ° M ^<=«ion » tes use 

30 h " Ca " S Pr ° Vided by DM * PI ««. such as 

30 d»_,.t_.v.»t.„ # as specified by the XDSM s ! 

Thesp flln( . fl , y e XD SM standard. 

the us r T S " e ed « cails f rom 

user space of the DM application into PFS 28, based 



IL9-2000-0067US1 



25 



39878S5 



S'.S 

■i 15 



20 



25 



30 



o°, n ZT f ::; t m appiicat±on with - — ry 
f u nct r^:r::: e ;^ s ' PF r k is r own in the - - 

thread The p roc " " ^ ° M W^cation 

threads and I " "** addit -n-l P** daemon 

ads and may p roC eed both in user and kernel space 

After DM application 32 has processed th 
general-* q ^ Processed the event, it 

generates its response to the event a * 

Rfi • event, at a response <;t-pn 

58, using the DMA PI function can h 

Session node 40 sends the ^_respo„d_event ( ) . 

42 at a response back to source node 

at a response sending step 60. The PFS on h A l 
oasspq ^ ts on node 42 

passes the event response to the f-n~ 

at a v Cfte flle operation thread 

at a response reception step 62 If thf > 

indicates thai- p if the response 

that the operation should be aborted the fn P 
operation returns to user anm i <- ■ 

PFS nr • application 30 without further 

processing; otherwise, the fii* • 
it „ PP ~ ' rne flle operation continues 

its PFS processing until completion at a ™ , ■ ■ 
aborting step 64. continuation or 

MOUNTING AND UNMOUNTING FILE SYSTEM INSTANCES 

operation 7s TV^ * ^ ^ ~ Ch ^ * -nt 

P ation ls Performed on one of i-hp 

S^ilarly, each unmount operation ! " 
Preun m ount and _ t evLL a 

enabled »n rt „ ° tS ' ass ™ l "9 such events are 

th reLre be 3 " " » ap plic a tion 32 shouId 

instances of the file , 7 ^responding to multiple 

nodes. By contrast ' "° Unt8d °" 

by the XDSM 11 h " Slngle " n0de •*»<=—. as implied 

d eal with Ire " ^"""^ »« "~ «^«d - 

™ Pe/^ " 

a result, m single-node 
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rz::i-~ — — — — 

Without seriaii,^,- . flle system. 

of the m" " T a " m ° Unt Unm ° Unt «*««ion. 

i-ne tue system, there 

designate the fi rst or la " "° P" ctl «± »ay to 

not even be a Z \ , " Unm ° Unt ' Th « e 

10 and the ^ o f b8t " een "™ b « »' -unt events 

given m " " PreU ™° Unt « u™»unt events for a 

LteLa "; rr; 2 r ( e d an u ™ that - 

Therefore, DMAPT ■ generate any events, 

by the XDSM standard. ° SS P rovided 

32, two ZlTl addit±0nal inf °— to DM a Pplication 

■ not jj^jT^^r and 

preunmount and unmount events L a m ° Unt ' 
fiplH., events (the me_mode and ne mode 

f-lds, respectively). When DM LOCAL MOUNT is 8 et ~ the 
^ ° r Ration concerned" is local to ' th 

sessxon node. When DM REMOTE MOUNT is set Z 
5 is at a node thpf ~ ' S operatlon 

this case th rem0te ^ SeSSi ° n n ° de - ^ 

used bv t : eV - n ° deid "eld mentioned above can be 

used by the session node to identifv t-h» 

which i-h* m~ 4. ntlfy the source node on 

h the mount ° r unmount operation is to 
These fi ano , to be Performed. 

ilies>e rxags are also us<=r) ■!,-, 4-u 

structure returned by " £ 

*»_get_ m ountin f o„. This fLt • fUnCtl ° n 

node, even if the ™" mCtl °" e "" f ™ ra any 

the fl i e system is not mounted ^ 
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o 

In 20 



it- s.i^rr t °;; t t th v"° - - - 

20. 6 ° f the n °des in cluster 

DM application 32 r-^n 

can make qood u=?t= ^-f +-k 
node informsf^n . the enhanced 

xnrormation provided bv thp ^ 

function for processina nf ^-Se^mountinf o ( ) 

- exa mple , oerore the DM TV and ^ 
mount event received f ^-ation responds to a 

node, it can / r0m 3 ^ thS ««io„ 

whether the " ^-^^ ° < > to determine 

une relevant file <*v*i-^ 

lo=.l ly at the session no d e £ Tot " th "^"^ 
pre f er ably performs a local «>« « .pplication 

Mount events are oref^Ki 

ahead of other rr: M L e r;;r t : the session 

response time of pfs 28 <-/ improve the 

queue is busy. m ° Unt ° Petati °" s "hen the 



mustJiL 1 JL/^ZfoTZl that 8oh -»«»y 

on source noCe „. ^ ^^IT 
embodiment of i-h~ Wlth a Preferred 

nt of the present invention ■ 

—ever a „ _ operetion Hnvo T Z ^ »" d 

invocation step 70. DMAPI 26 on source 1/ " ~ 
• Preun m ount event „essa ge and „ aits "r the 
from the DM application response 
This me ss age P 1 7 1Catl0n ' " a " -™»t generation step 72. 

session ^ 4 "^'"^ » ™» >< »» 

response from „ol To" 77 ^ " ""^ ^ 

14. ' at 3 res P°nse receipt step 

DMAPI 26 checks whether th* 
-rce d unmount, at . ^L^uTZ T"" 0 " " 3 
— . an y outstan d in g aC cess IWC; 
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file system on source node 42 are released ^ 
step 78. DMA PI 26 i-h released, at a cleanup 

at a „ Permits the unmount to proceed 

at an unmounting step 80. Proceed, 

On the other hand if i-h~ 

"<juu, ii the unmount is nnf- = -f 

5 Unm ° Unt - DMAPI 26 checks „ hether " 

"abort" resnnn.c * received an 

response from session node 40 at 
checking step 82. If so th °' ab ° rt 

failed Simila , S unmoun t operation is 

Sxmxlarly, the unmount is failed if th« 
still DM access rights to the fil, " *" 

10 42 at an a system on source node 

at an access right checking steo 84 

these cases an P In either of 

cases, an error code i s set fnr 

operation on node 42 at an unmount 
arp n 2 ' at 3n error step 86. Only if there 

are no outstanding access rights on node 42 

unmount proceed normally a t step 80. 

Whether the unmount was performed successfully (step 
80) or not (step 86), DMA PI 26 generate 

and waits for a response at """^ ^ 

step 88. After 1 P ' ^ an Unm ° Unt generation 

that was set t ^7s ^ "™ ^ 

0 operation, at an ^JL".?^ * " 

DM ACCESS RIGHTS 

DM applications acquire dm 
objects and , = . • access rights to file 

]ects and associate them „ith event tokens n M 
rights are re q uir ed in some or the dTp! f 
> specified in the XDSM standard (which are Til ^ 
the multi-node environment by the present ■ ln 
ord.r to avoid overhead that — tron, . I„ 

managing acceS s riaht ^ inCU " ed b * 

..ttinj all the tclss r h ? UtiVely ^ 

event are preferablv associated „ith a given 

node. Thus a" * ^responding —ion 

Thus, all requests to acquire, change, guery or 

IL9-2000-0067US1 , Q 
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Fxle operations must abide by DM access rinht 
particular fn access rights. In 

application. Conversely the dm ^ 
prevented from acquir ^' *° ° M '^"-tion must be 

conflicting (ile operation ■ " hilS * 

these access rights are ^ " Pr ° 9reSS ' P " f «' bl * 
10 lockina m „ h implemented using the internal 

locking m echan lsms of PFS 28, such as the GPFS look- 
mechanisms described in the sh „ locking 
« Schmuck et al. above-mentioned patents by 

i'O DM access Hahi-c; -; ^ . 

» treated as an add L/^e "TT " ""'"^ 

,5 15 locks acguired during m. Vc«„ ^ ^""^ <* 

™ and releasing accJl" J*™^ 

lock " -i , 6rred to herein as the "DM 

o access 'a d terlStiCS » »» ^ 

ac^iZ r r Toc 0 ; 7^ ^ ^ * »< ^ 
management operatTn three; T'T" ^ " 
as those described by s h™ k e Til 7 l0CkS 
this purpose, since OM accl riaht " ^ *« 
25 multipie kernel calis and t " " Cr °" 

application threads "to" ""'^ DM 

The existino 7 , ^ thr ° U 9 h the *«m.l . 

access to Lie da;;" 

held. Preferabl t' ""^ ' aCC *" rlght " 

hierarchy. flie lockln 9 
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™ ^ " COn " iCt table that defines 

res ale use r ^"'^ ^ ClU "« ^ ~« ^ 

DMX - data management exclusive access. 

DMS - data management shared access. 

FSX - file system exclusive access. 

FSS ' file system shared access. 
A n "X" m the table indicates * ™ -ri ■ 
corresponding mo des. """^ the 



10 



15 



20 



25 



TABLE I - DM ACCESS RIGHTS 




DMX and DMS mod<=>=: ^r-^ 
- y provide exclusive r \; d Se ^:; re y d in a DM ~ * " 

each i ndl vid^ DM ^"t^"' 
defined by the XDSM standard. Ration, as 

FSX and FSS modes are used in 
order rn n flle derations, in 

rl,L ' PreV6ntS 3C ^ isi "- of any DM access 

rights but S ! reVentS -" iSi "™ «* -i-ive 0„ access 

^ pi" :/ 0 : ;°; fllct with sh - ed - ™ 

data in a file K nation that modifies the 

a FSX 1 t 3SCt " dSStrOYS thE »«1 acquire 

^ " h " eaS ' FSS — «"1 for other 

j.j.xe operations. There -i <=: r.^ ^-i • 

6 13 no conflict between the FSX 
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and FSS modes, because fn. 

one another f or DM Z " ^ ^ ™ 

'educing the performance iJ^VVh ™ " imPOrtant in 
in parallel file svsfA ^ addlt i°nal DM lock 

5 systems are r s Tl T ^ * ^ 

nodes. Y COmmu — ation among multiple 

In the GPFS parallel file svstpm 
Schmuck et al f ± , „ , , . Wtem, as described by 

— — ^ai ^:: e ;; p rr usin9 a token 

10 and re »„ kes th6m „ hen oth6 ° r ke ^ od t0 »«»" 

revests. For the D „ ^ »*• -fUctin, 

grants a Fsx or F« ' t0ken m ° a !3« 

to reTOke th r: 0 ;r: - * a «-~ - u be no 

a DM lock on the fil. P ^^cation requests 

« a user appl r ca :: ; — by 

. . ' interference bv 1- ,,r-. -,• ~ - -, DM 



applications ls general, I " 

added overhead eXpeCted «» be m ini ma l. . lu . 

shouid therefore be J^X. » d "vocations 

the duratl: l^lj"'^ ^ "* ° M o niy for 

« — 0 ; r h r a r D r — — ■ 

access right in cue , ^ved by associating the 

- Presented Z .Z^'^^ ^ ^ "» 

are acquired only for th . „ " eC "" ri9htS 

function. For th °" °* th * DMAPI 

£or tnis purpose ^ ttqo 

sufficient, instead of th! mo' °* * ±S 

the more restrictive DMS or DSN 
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lock. The tokenless DMA PI rail i-v, 

of dm i^u- huS USes the sa ™e type 

of DM lockxng as do regular file operations. 

INVOKING DMAPI FUNCTIONS IN A CLUSTER 

5 in tne Pr T Srred emb ° dimentS ° f the present invention, 
List SnVi — <* — ter 20, DMAPI 

cluster. Functus that do not change the state of a 
sesaxon or event can oe invoked freely at any node in 
order to enable DM applications to exploit the h 
10 parallelism of the PFS ° SXploit the Cerent 

ft w , dm Punch holeO and 

« ^ : he ° ther ^ d - pi — - 



i5 



15 



Fig. 5 



P y " ls a flow rharf = 

-•^ cnart that schematically 



U 2 0 



il],,^ r3 , " scnematically 

illustrates a method for processing a DMAPI function call 
that proves as parameters a session 10 and event toLn 
" • c ~« ta "« «"■> a preferred embodiment of the p « 
invention. Th is method is invoked when DMflPI 26 ™Z. 
• function cal! from OM application 32 on node 22 at a 
function calling step 90 DMSPI o K 

node 2? i = ascertains whether 

session SeSSi ° n n ° de <SUCh « "<*• «, for the 

f at ^ SSSSlon nod * ^termination step 92. If so 

the retired function is carried out oy P FS 2 8 in I 
25 function performing step 100. 

If the node invoking the function call is not the 
session node for this session, DMflP i 26 deterraines 
Whether this function changes the state of the DpT event 

30 stater" SP6Clfled ^ XDSM 'f"^). at a 

state change determination step 96. Ie S o, the requested 
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function is failed, and DMA PI ^ 

JAFI 26 returns an error to DM 
application 32, at a failure step 98. 

On the other hand, if this is not the session node 
and the DMAPI function does _ ^ ^ 

or session state, the call proceeds in step 100 as long 
^s tne session ^xi^-ho ~„ 

exists on some node, and the required 

ITU 1S PrSSented - *"«»-"y. the DM application 

on the requesting node caches a copy of the token. 

DMAPI FAILURE AND RECOVERY MECHANISMS 
The failure ra odel defined in the XDSM standard is 
IZTr ^ " Single - n ° de -s™t«. in which two types of 

n :?t 0 :t:: £aiiures may m dm 

S fails L PT S fa " Ure - Whe " ° nly the ™ appiication 

. fans, DMAPI resources, such as sessions and events, 

15 remain intact. As a result- fi 

T unstaM „ . result, file systems may become 

2 h SlnC9 may be pendln9 «<« "loocd 

S -"threads, waiting for response by the DM application. 

IS rest T thiS Sit » a «™. ^e DM application must 

H 20 off F ^ eXiSting " h «« « left 

fm ' f . PUrP ° Se ' the XDSM standard provides DMAPI 

functions that enahlp i-v,^ 

restarted DM application to 
query the session queue and handle any pending events 

Recovery from total pes failure is a matter for the 

25 stand , and ^ ° f XDSM 

standard. when the PES fails, all non-persistent DMAPI 
resources are lost. The PES is expected to clean up its 
own state when it is restarted. The DM application can 
then restart as well. since sessions are not persistent, 
there is no need in this case for session recovery 

These two models do not describe all of the possible 
types of failure that may occur in a multi-node parallel 
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file system as is used i„ preferred embodlments of ^ 

l p ::l :rr; on - The muiti - node — - 

"L svst ' Sin9le -"° d * in „ hich . 

5 of he T """" ° r fa " » «»• or more 

n h I " WhUS COn " nUi - tQ — ™ on the others, 
in such a case, DMflPI 26 should alsQ cont 

and enable file ace... „„ operate 
PFS re>r . 0 SSS ° n the solving nodes, while the 

applicat °" n0d "- * ^tribute* o M 

„hLe oth COnCinUe rUnni " 9 °" 

while other nodes (possibly i„ c i uding the sess±on ' 

tor the D„ application, have failed and are in the 
process of recovery. 

■y.t« Tir* fa " Ure ° COUr in the -lti-„ode 

inaccessible on the node or when the entire Prs falls on 
the node. i n th? lsf^. s on 

l- 1 1 fci latter Ca=;o 3 1 l -c n 

that ^ lle s y stem instances 

node H T" ^ "» PFS -cc inaccessible on that 

node. Handllng of recovery ^ 

failure depend on whether the failed „ h ■ 
node nr = ■ tailed node is a source 

node or a session node. When source node 42 fails (F i g 
2), events generated by that node beco me obsolete If 
such events were already enqueued at session node 40, 0„ 
application 32 will continue to process the events. Th e 
processing may be unnecessary, since there is no longer 
any file operation waiting f or the respons£< 9 
harness aside , rom the attendant loss in efficiency 

When a S Te i0n failU " S to — 

all sL failS ' DMAPI *»o»rc. including 

all sessions, are lost on the failing node, although not 

on other, surviving nodes. Pile operations on the 

surviving nodes may still be blorkn >. 

for r..r< * blocked, however, waiting 
for response fro* an event previously sent to a failed 
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15 



SeSS1 ° n - " ±S therefore important to recover the 

session, possibly on another ^ ^ ^ ^ 

of pending events, so that file operations on the 

surviving nodes will be shio i-^ 4. • 

1 De ai3le to continue without failure. 
Fxg. 6 is a flow chart ^ schematica 

illustrates a method for dealing with session failure on 
node 40, in accordance with a preferred embodiment of the 
present invention. This type of failure scenario is not 
addressed by the XDSM standard. The present method is 
needed to ensure that events generated at any of the 
nodes in cluster 20 can be handled in the proper manner, 
so that user threads that generated those events can 
accordingly be unblocked. 

Session failure is detected „ • 

uececred at a session failure 

step 1 10 , preferably by a "heartbeat" or group service 
13 that .hecks connectivity in duster 20, as is known in 

„ the art. Session manager (SM) 34 plays an important role 

I.;. during session recovery. when session node 40 fails SM 

jjj 34 1S notified, at a notification step 112. The SM marks 

I! rt h ! " SSi ° n 33 fa " ed ' bUt °* the session 

details. 

Recovery following the session failure is triggered 
by DM application 32, at a triggering step 114. There are 

25 H°,r yS ^ trig9eri " 9 S " si - recovery, depending on 
whether the DM application itself has also failed, or 
only the PFS has failed: 

• Explicit recovery - if the DM application fa±led 
at the same time as the session (due to node crash, 
f ° r eXam ? le >' ^ must be restarted, possibly on 
another node. The restarted dm application 

explicitly assumes the old session, using the DMAPI 
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function dm_create_session() and specifying the 
session ID. Assumption of the failed session 
trxggers reconstruction of the session queue and the 
events on it on the new session node, as described 
below. The DM application can then continue 

handling events. If the DM application survives, it 
may notice that the session has failed, on account 
of the error codes that it receives in response to 
DMA PI function calls. In this case, the application 
may wait for the PFS to recover on the failed node, 
or it may alternatively move to another node and 
assume the failed session, as described above. 
• Implicit recovery - since the DM application 
executes as a separate process, independent of the 
15 PFS ' ±t: is Possible that the PFS will recover 

before the DM application has even noticed the 
failure. In this case, session queue reconstruction 
is triggered implicitly when the DM application 
invokes any DMA PI function for the session at the 
20 session node. Explicit session assumption is 

unnecessary in this situation. 

Calling dm_create_session() and supplying the 
session ID (explicit recovery) or calling any DMA PI 
function for the session (implicit recovery) causes DMAPI 

25 26 on the new session node to contact SM 34, at a contact 
step 116. The SM records the new session node and 
session information string, if any, and changes the 
session state from failed to valid. It broadcasts the 
updated session details to all nodes, at a broadcast step 

30 118. * 
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While recovering the session, it is necessary to 
reconstruct the session queue at the new session node 
Preferably, in order to reconstruct fche session 

session node 40 broadcasts a request to all of the nodes 
xn cluster 20, at a request step 119. Upon receiving 
this request, the surviving nodes resubmit any pending 
synchronous events they may have to the new session node, 
at a resubmission step 120. This step causes the event 
tokens to be regenerated with the same IDs as they had 
10 before the failure. Certain events may not be 

recoverable by DMAPI 26 following a session failure 
Asynchronous events are lost with no harm done. Events 
that originated from the failed session node, including 

user events, cannot be recovered hv u ■ 

^ uverea rjy the resubmission 

lo mechanism described above. 

Session failure results in the loss of resources 

associated with the events in 

events m the session queue, 

including DM access rights. These resources are not 
.recovered simply by resubmitting the events As a 

20 result, DMAPI functions may fail after session recovery 

due to invalid DM access rirrht-o ir„ 

aLLe5S rignts. Furthermore, DMAPI 26 

cannot determine after recovery which events were already 
being handled by the DM application prior to the failure, 
7q ^ lt grantee that none of the files in question 

25 were accessed or modified before the failure. All events 

resubmitted after ^ fa-Mnv-^ 

arrer a failure revert to the initial 

(non-outstanding) state. Similarly, when only a file 
system instance fails at the session node, all DM access 
rxghts for files in the file system are lost, although 
the events and tokens remain. After the file system 
instance is remounted, the DM application must reacquire 
the access rights. There is no guarantee that objects 
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have not been modified while the access rights were not 
held. Therefore, DM application 32 should be written so 
as to recover consistently from the loss of DM access 
rights, notwithstanding the associated loss of 
information. For example, the DM application could be 
programmed to keep its own state of events in progress or 
to implement an appropriate consistency protocol, as is 
known in the art. 

It is desirable to provide mechanisms that will 
speed up recovery from a session failure and will prevent 
indefinite blocking of user applications. Preferably, if 
a failed session does not recover after a predetermined 
lapse of time, pending events are aborted at source node 
42, and file operations associated with the events are 
15 failed. User applications 30 can then retry the failed 
operations as appropriate. 

It may also occur that SM 34 fails, in which case a 
new SM is appointed. The new SM must recover all of the 
information that was maintained by the previous SM. For 
this purpose, the new SM preferably broadcasts a request 
to all nodes for session information. Each node responds 
by sending the SM a list of all the sessions existing on 
that node and the event dispositions for all the file 
systems that are mounted at the node. The new SM uses 
thxs information to rebuild the collective session and 
disposition information. 

The only information that may not be fully recovered 
in this manner is concerning sessions that existed on the 
failed SM node. As indicated above, each node preferably 
keeps partial information on every session. Most of the 
missing information can therefore be retrieved locally at 
the new SM node, from the list of all sessions that was 
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maintained at that node. The only unrecoverable 

information regarding sessions that existed on the failed 
SM node is the session information string. As noted 
above, typically only a prefix of this string is 
maintained on nodes other than the session node itself. 

The new SM reconstructs event dispositions based on 
information that it receives from the file system 
managers (FSMs) 36 in cluster 20. If a FSM has failed, 
the corresponding information is recovered from one of 
the other nodes that mounted the file system in question. 
If all of the nodes that mounted the file system have 
failed, however, the dispositions are lost, and new 
dispositions for the file system will have to be set when 
the file system is mounted on some node. 
15 After reconstructing the dispositions, the new SM 

must make sure that the information it now has is 
consistent with all of the nodes. For this purpose, the 
SM sends its reconstructed dispositions to the FSMs, 
which in turn send it to all the nodes that have mounted 
corresponding file system instances. If some file system 
has no FSM (typically due to FSM failure), the 
dispositions are held at the SM and are sent to the new 
FSM when one is appointed. 

Although preferred embodiments are described 
25 hereinabove with reference to a particular configuration 
of cluster 20 and parallel file system 28, it will be 
appreciated that the principles embodied in DMA PI 2 6 are 
similarly applicable in other parallel file system 
environments, as well. It will thus be understood that 
the preferred embodiments described above are cited by 
way of example, and that the present invention is not 
limited to what has been particularly shown and described 
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hereinabove. Rather, the scope of the present invention 
includes both combinations and subcombinations of the 
various features described hereinabove, as well as 
variations and modifications thereof which would occur to 
5 persons skilled in the art upon reading the foregoing 
description and which are not disclosed in the prior art. 
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